{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Active Learning Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** This code utilizes functions from an older version of the Label Studio SDK (v0.0.34). While the newer versions v1.0 and above still support the functionalities of the old version (see `label_studio_sdk._legacy` for reference), we recommend using the latest Label Studio SDK v1.0 or higher.\n",
    "\n",
    "Active learning is a branch of machine learning that seeks to minimize the total amount of data required for labeling by strategically sampling data that provides insight into the problem you're trying to solve so that you can focus on labeling that data.\n",
    "\n",
    "Follow this example to write a Python script using the [Label Studio SDK](https://labelstud.io/sdk/index.html) that performs active learning with a text classification machine learning model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install label-studio-sdk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start by configuring the connection to the Label Studio API. You can retrieve your API key from your user profile in Label Studio. In your script, write the following, replacing the **LABEL_STUDIO_URL** and **LABEL_STUDIO_API_KEY** with your own: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LABEL_STUDIO_URL = 'https://app.heartex.com'\n",
    "LABEL_STUDIO_API_KEY = '096ef33e96d6c97b522a58c854ae4d592600c130'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, import the Client module from the Label Studio SDK to make sure that you successfully connected to the API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from label_studio_sdk import Client\n",
    "\n",
    "ls = Client(url=LABEL_STUDIO_URL, api_key=LABEL_STUDIO_API_KEY)\n",
    "ls.check_connection()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After connecting to the Label Studio API with the SDK, create a project in Label Studio to perform active learning with your data labeling tasks. This project performs sentiment analysis for a passage of text. See the [sentiment analysis template](https://labelstud.io/templates/sentiment_analysis.html) for more. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from label_studio_sdk._legacy.project import ProjectSampling\n",
    "\n",
    "project = ls.start_project(\n",
    "    title='AL Project Created from SDK - DEMO',\n",
    "    label_config='''\n",
    "    <View>\n",
    "    <Text name=\"text\" value=\"$text\"/>\n",
    "    <Choices name=\"sentiment\" toName=\"text\" choice=\"single\" showInLine=\"true\">\n",
    "        <Choice value=\"Positive\"/>\n",
    "        <Choice value=\"Negative\"/>\n",
    "        <Choice value=\"Neutral\"/>\n",
    "    </Choices>\n",
    "    </View>\n",
    "    '''\n",
    ")\n",
    "project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In an active learning scenario, you want to label the tasks with the lowest machine learning model prediction scores first. You can set up **uncertainty sampling** for your tasks to automatically reorder tasks by prediction score, from low to high."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.set_sampling(ProjectSampling.UNCERTAINTY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up an example machine learning model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This examples uses a simple TF-IDF Text Classification model built on the [`scikit-learn` API](https://scikit-learn.org/stable/). To perform active learning with this model, we must be able to retrain model weights and make inferences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "labels_map = {\n",
    "    'Positive': 0,\n",
    "    'Negative': 1,\n",
    "    'Neutral': 2\n",
    "}\n",
    "inv_labels_map = {idx: label for label, idx in labels_map.items()}\n",
    "\n",
    "\n",
    "def get_model():\n",
    "    # Initialize model with random weights\n",
    "    return make_pipeline(TfidfVectorizer(), LogisticRegression(C=10, verbose=True))\n",
    "\n",
    "\n",
    "def train_model(model, input_texts, output_labels):\n",
    "    # Train the model, given a list of input texts and output labels\n",
    "    model.fit(input_texts, [labels_map[label] for label in output_labels])\n",
    "\n",
    "\n",
    "def get_model_predictions(model, input_texts):\n",
    "    # Make model inference and return predicted labels and associated prediction scores\n",
    "    probabilities = model.predict_proba(input_texts)\n",
    "    predicted_label_indices = np.argmax(probabilities, axis=1)\n",
    "    predicted_scores = probabilities[np.arange(len(predicted_label_indices)), predicted_label_indices]\n",
    "    return [inv_labels_map[i] for i in predicted_label_indices], predicted_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perform active learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collect the annotated tasks from Label Studio so that you can use them to train the model. Each task is stored in [Label Studio JSON format](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks), with `\"text\"` field used as input and `\"choices\"` annotation field to store output label:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labeled_tasks = project.get_labeled_tasks()\n",
    "if not labeled_tasks:\n",
    "    print(f'No labeled tasks found in project \"{project.title}\" (id={project.id}).\\nOpen {LABEL_STUDIO_URL}/projects/{project.id} and annotate your data first!')\n",
    "else:\n",
    "    texts, labels = [], []\n",
    "    for labeled_task in labeled_tasks:\n",
    "        texts.append(labeled_task['data']['text'])\n",
    "        labels.append(labeled_task['annotations'][0]['result'][0]['value']['choices'][0])\n",
    "    print(f'Found {len(texts)} annotated texts and {len(set(labels))} classes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the model weights based on the annotated tasks. We use `\"train_model()\"` function from example machine learning described above, but in principle it could be any other classifier trainer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_model()\n",
    "train_model(model, texts, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After collecting the annotated tasks, collect the unlabeled tasks so that the machine learning model can make predictions. Because there can be a large number of unlabeled tasks, you can sample them and retrieve only a small subset of data. In this case, collect a random sample of 100 unlabeled tasks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "unlabeled_tasks_ids = project.get_unlabeled_tasks_ids()\n",
    "batch_ids = random.sample(unlabeled_tasks_ids, 100)\n",
    "unlabeled_tasks = project.get_tasks(selected_ids=batch_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the subset of unlabeled tasks that you collected, you can make model inferences to get the predictions from the text classification model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "texts = [task['data']['text'] for task in unlabeled_tasks]\n",
    "pred_labels, pred_scores = get_model_predictions(model, texts)\n",
    "\n",
    "# Example:\n",
    "for i, (text, labels, scores) in enumerate(zip(texts, pred_labels, pred_scores)):\n",
    "    print(f'{text} --> {labels} [{scores}]')\n",
    "    print('===============================')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Send predictions to Label Studio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the model makes its predictions, return the predictions to Label Studio so that annotators can review and update them. \n",
    "\n",
    "Define a model version to identify the latest batch of predictions, in this example based on the amount of data used to retrain the model, but you can use any arbitrary unique name. Setting model version is optional step, but in Active Learning scenario, it helps you to control which model to show in the next iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_version = f'model_{len(labeled_tasks)}'\n",
    "print(model_version)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Format the predictions and add them to each task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = []\n",
    "for task, pred_label, pred_score in zip(unlabeled_tasks, pred_labels, pred_scores):\n",
    "    predictions.append({\n",
    "        'task': task['id'],\n",
    "        'result': [{\n",
    "            'from_name': 'sentiment',\n",
    "            'to_name': 'text',\n",
    "            'type': 'choices',\n",
    "            'value': {\n",
    "                'choices': [pred_label]\n",
    "            }\n",
    "        }],\n",
    "        'score': pred_score,\n",
    "        'model_version': model_version\n",
    "    })\n",
    "    \n",
    "project.create_predictions(predictions)\n",
    "print(f'{len(pred_labels)} tasks have been preannotated with model predictions')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, update the Label Studio settings to use the newly-created model version when performing uncertainty sampling and displaying pre-annotated tasks to annotators:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.set_model_version(model_version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it. Now you can open your project page, check **Predictions** and **Prediction score** column "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! The Label Studio SDK makes creating an active learning loop that is easily repeatable with a script. You can run this script on a regular cadence, or use [Label Studio Webhooks](https://labelstud.io/guide/webhooks.html) to perform event-driven active learning. See more about [active learning in Label Studio](https://labelstud.io/guide/active_learning.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "labelops",
   "language": "python",
   "name": "labelops"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
